
x86 normally has 8 32-bit GPRs

x86 + MMX has 8 32-bit GPRs and 8 64-bit MMX regs

x86 + SSE has 8 32-bit GPRs and 8 128-bit XMM regs

x64 has 16 64-bit GPRs and 16 128-bit XMM regs

note that XMM/MMX regs cannot be used to address memory, but they're
perfectly fine for ALU ops (in a sense: they work in banks that
operate in parallel on their subcomponents, which is a bit odd)

they also make lovely "level-0" spill slots. consider that there are
effectively 16 (MMX) or 32 (SSE) GPRs of data available here, and on
modern core chips they're all single-cycle access!


So ... hmm. You can design a *really* fast fastcall with this. 

MMX is 10 years old (1997). Let's assume anyone who gives a damn has
MMX. ELF showed up in the system 5 ABI from the same year, so
... seriously. It's supported. So you can at bare minimum treat each
MMX reg as its own independent ALU GPR and ignore the high doubleword.

(PAND, POR, PXOR, PADDD, PSUBD, PCMPEQD, PCMPGTD, PSRAD, PSRLD, PSLLD)

so that buys us 8 more GPRs most of the time. we can probably tolerate
simulating 64 bit ints using "2 32-bit GPRs"; it's not the most
efficient use of the hardware on your desk, but it's easier to
generate code that way, fewer special cases. when you're actually in
64-but mode we can scale the assumptions up, everything doubles.

notes on GPR constraints / uses:

EAX  - GPR w/ subregs. used for return value in most call conventions.
EBX  - GPR w/ subregs. used for GOT pointer in ELF.
ECX  - GPR w/ subregs.
EDX  - GPR w/ subregs.

EBP  - GPR, named for base pointer, not needed since we know stack sizes
ESI  - GPR.
EDI  - GPR.
ESP  - GPR but almost always stack pointer. reserved?

MMX0 .. MMX7 - GPR (with restrictions)

theoretically the sky is the limit for using these. That's 16 32-bit
GPRs on *most* x86 machines we're likely to encounter. we are using
stack frames in heap segments so we always have to open-code our own
stack fiddling code. which is fine.

ok fine, we need a real register allocator and register-heavy calling
convention for that.

what's a nice way of modelling our needs? we have a few
pseudo-variables (current process, current stack frame, current
environment closure, current yield and return addresses) we probably
need frequent access to, but do we just feed them all into a standard
reg allocator and let it do its work? let's try that.

feels like we won't be using call or ret at all, just jmp. that's
fine. so a frame has a yield address and a ret address, and inside an
iterator a yield?/yield!/yield+/yield* operation passes the *caller's*
yield address up to the inner func; this is effectively a tail-yield
(temporarily) until the inner func completes yielding and returns.

Tail calls are done with become/become?/become!/become*/become+: if I
become another func/func?/func!/func*/func+ call, my frame is
destroyed and both yield address and return address are forwarded to
the callee.

